# Vision-Language Model

GUI Actor 2B Qwen2 VL
MIT
GUI-Actor-2B is a vision-language model based on Qwen2-VL-2B, specifically designed for graphical user interface (GUI) positioning tasks. By adding an attention-based action head and fine-tuning, it performs well in multiple GUI positioning benchmark tests.
Text-to-Image Transformers
G
microsoft
163
9
Dreamer 7B
Apache-2.0
WebDreamer is a planning framework capable of achieving efficient and effective planning for web agent tasks in the real world.
Image-to-Text Transformers English
D
osunlp
62
3
Gemma 3 27b It GGUF
GGUF quantized version of Gemma 3 with 27B parameters, supporting image-text interaction tasks
Text-to-Image
G
Mungert
4,034
6
STEVE R1 7B SFT I1 GGUF
Apache-2.0
This is a weighted/matrix quantized version of the Fanbin/STEVE-R1-7B-SFT model, suitable for resource-constrained environments.
Text-to-Image English
S
mradermacher
394
0
Gemma 3 4b It GGUF
Gemma 3 is a lightweight open-source multimodal model from Google, supporting text and image inputs with text outputs, featuring a 128K context window and support for 140+ languages.
Image-to-Text
G
ggml-org
9,023
25
Q Sit
MIT
Q-SiT Mini is a lightweight image quality assessment and dialogue model, focusing on image quality analysis and scoring.
Image-to-Text Transformers
Q
zhangzicheng
79
0
Llama 3 2 11b Vision Electrical Components Instruct
MIT
Llama 3.2 11B Vision Instruct is a multimodal model combining vision and language, supporting image-to-text tasks.
Image-to-Text English
L
ankitelastiq
22
1
Llava NeXT Video 7B Hf
LLaVA-NeXT-Video-7B-hf is a video-based multimodal model capable of processing video and text inputs to generate text outputs.
Video-to-Text English
L
FriendliAI
30
0
Libra Llava Med V1.5 Mistral 7b
Apache-2.0
LLaVA-Med is an open-source large vision-language model optimized for biomedical applications, built on the LLaVA framework, enhanced through curriculum learning, and fine-tuned for open-ended biomedical question answering tasks.
Image-to-Text Transformers
L
X-iZhang
180
1
Florence 2 Base Castollux V0.4
An image caption generation model fine-tuned based on microsoft/Florence-2-base, focusing on improving description quality and format
Image-to-Text Transformers English
F
PJMixers-Images
23
1
Llava Llama3
LLaVA-Llama3 is a multimodal model based on Llama-3, supporting joint processing of images and text.
Image-to-Text
L
chatpig
360
1
UI TARS 7B DPO
Apache-2.0
UI-TARS is a next-generation native graphical user interface (GUI) agent model designed to seamlessly interact with GUIs through human-like perception, reasoning, and action capabilities.
Image-to-Text Transformers Supports Multiple Languages
U
ByteDance-Seed
38.74k
206
UI TARS 2B SFT
Apache-2.0
UI-TARS is a next-generation native graphical user interface (GUI) agent model designed to seamlessly interact with GUIs through human-like perception, reasoning, and action capabilities.
Image-to-Text Transformers Supports Multiple Languages
U
bytedance-research
5,792
19
UI TARS 2B SFT
Apache-2.0
UI-TARS is a next-generation native Graphical User Interface (GUI) agent model designed to seamlessly interact with GUIs through human-like perception, reasoning, and action capabilities.
Image-to-Text Transformers Supports Multiple Languages
U
ByteDance-Seed
5,553
19
Deqa Score Mix3
MIT
DeQA-Score-Mix3 is a no-reference image quality assessment model fine-tuned based on the MAGAer13/mplug-owl2-llama2-7b foundation model, demonstrating excellent performance across multiple datasets.
Image-to-Text Transformers English
D
zhiyuanyou
4,177
2
Colqwen2 7b V1.0
A visual retrieval model based on Qwen2-VL-7B-Instruct and ColBERT strategy, supporting multi-vector text and image representation
Text-to-Image English
C
yydxlv
25
1
Videochat TPO
MIT
A multimodal large language model developed based on the paper 'Task Preference Optimization: Improving Multimodal Large Language Models through Visual Task Alignment'
Text-to-Video Transformers
V
OpenGVLab
18
5
Olympus
Apache-2.0
Olympus is a general task routing system designed for computer vision tasks, capable of handling 20 different visual tasks and achieving efficient multi-task processing through task routing mechanisms.
Text-to-Image Transformers English
O
Yuanze
231
2
Llava Critic 7b Hf
This is a transformers-compatible vision-language model with image understanding and text generation capabilities
Text-to-Image Transformers
L
FuryMartin
21
1
BLIP Radiology Model
BLIP is a Transformer-based image captioning model capable of generating natural language descriptions for input images.
Image-to-Text Transformers
B
daliavanilla
16
0
Vit GPT2 Image Captioning Model
An image caption generation model based on the ViT-GPT2 architecture, capable of converting input images into descriptive text
Image-to-Text Transformers
V
motheecreator
142
0
Colqwen2 V0.1
Apache-2.0
A visual retrieval model based on Qwen2-VL-2B-Instruct and ColBERT strategy, capable of efficiently indexing documents through visual features
Text-to-Image English
C
vidore
21.25k
170
Cogflorence 2.2 Large
MIT
This model is a fine-tuned version of microsoft/Florence-2-large, trained on a 40,000-image subset of the Ejafa/ye-pop dataset, with annotation texts generated by THUDM/cogvlm2-llama3-chat-19B, suitable for image-to-text tasks.
Image-to-Text Transformers Supports Multiple Languages
C
thwri
20.64k
33
Lumina Mgpt 7B 512
Lumina-mGPT is a family of multimodal autoregressive models excelling in various vision and language tasks, particularly in generating flexible and realistic images from text descriptions.
Text-to-Image
L
Alpha-VLLM
1,185
4
Cogflorence 2 Large Freeze
MIT
This is a fine-tuned version of the microsoft/Florence-2-large model, trained on a subset of 38,000 images from the Ejafa/ye-pop dataset, using CogVLM2-generated annotations, focusing on image-to-text tasks.
Image-to-Text Transformers Supports Multiple Languages
C
thwri
419
14
Vit Base Patch16 224 Distilgpt2
Apache-2.0
DistilViT is an image caption generation model based on Vision Transformer (ViT) and distilled GPT-2, capable of converting images into textual descriptions.
Image-to-Text Transformers
V
tarekziade
17
0
Tic CLIP Bestpool Sequential
Other
TiC-CLIP is a vision-language model trained on the TiC-DataComp-Yearly dataset, employing continual learning strategies to keep the model synchronized with the latest data
Text-to-Image
T
apple
280
0
Tic CLIP Bestpool Oracle
Other
TiC-CLIP is an improved vision-language model based on OpenCLIP, focusing on temporal continual learning, with training data spanning from 2014 to 2022
Text-to-Image
T
apple
44
0
Llava Phi 3 Mini 4k Instruct
MIT
A vision-language model that combines the Phi-3-mini-3.8B large language model with LLaVA v1.5, providing advanced vision-language understanding capabilities.
Image-to-Text Transformers
L
MBZUAI
550
22
Llava Phi 3 Mini Gguf
LLaVA-Phi-3-mini is a fine-tuned LLaVA model based on Phi-3-mini-4k-instruct and CLIP-ViT-Large-patch14-336, specializing in image-to-text tasks.
Image-to-Text
L
xtuner
1,676
133
Vlrm Blip2 Opt 2.7b
MIT
A BLIP-2 OPT-2.7B model fine-tuned with reinforcement learning, capable of generating long and comprehensive image descriptions
Image-to-Text Transformers English
V
sashakunitsyn
398
17
Blip Finetuned Fashion
Bsd-3-clause
This model is a visual question answering model fine-tuned from Salesforce/blip-vqa-base, specializing in the fashion domain
Text-to-Image Transformers
B
Ornelas
2,281
0
Thai Trocr Thaigov V2
A Thai handwritten recognition model based on vision encoder-decoder architecture, suitable for various Thai OCR tasks
Image-to-Text Transformers Other
T
kkatiz
339
13
Infimm Hd
InfiMM-HD is a high-resolution multimodal model capable of understanding and generating content that combines images and text.
Image-to-Text Transformers English
I
Infi-MM
17
27
Tecoa2 Clip
MIT
A vision-language model initialized from OpenAI CLIP, adversarially fine-tuned on ImageNet with enhanced robustness features
Text-to-Image
T
chs20
53
1
Fare4 Clip
MIT
Vision-language model initialized with OpenAI CLIP, enhanced robustness through unsupervised adversarial fine-tuning
Text-to-Image
F
chs20
45
1
Internlm Xcomposer2 7b 4bit
Other
InternLM-XComposer2 is a vision-language large model (VLLM) based on InternLM2, featuring advanced image-text understanding and creation capabilities.
Image-to-Text Transformers
I
internlm
74
10
Internlm Xcomposer2 Vl 7b 4bit
Other
A vision-language large model based on InternLM2, with outstanding image-text understanding and creation capabilities
Image-to-Text Transformers
I
internlm
1,635
27
Quilt Llava V1.5 7b
Quilt-LLaVA is an open-source chatbot fine-tuned on LLaMA/Vicuna using multimodal instruction-following data generated from histopathology educational videos and GPT.
Text-to-Image Transformers
Q
wisdomik
618
6
Moe LLaVA Qwen 1.8B 4e
Apache-2.0
MoE-LLaVA is a large vision-language model based on the Mixture of Experts architecture, achieving efficient multimodal learning through sparse activation parameters
Text-to-Image Transformers
M
LanguageBind
176
14
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase